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Since it was first published 30 years ago, IChi et all 's seminal paper on expert and novice catego- 
rization of introducto ry problems led to a plethora of follow-up studies within and outside of the area 
of physics [Chi et all Cognitive Science 5, 121 - 152 (1981)]. These studies frequently encompass 
"card-sorting" exercises whereby the participants group problems. While this technique certainly 
allows insights into problem solving approaches, simple descriptive statistics more often than not fail 
to find significant differences between experts and novices. In moving beyond descriptive statistics, 
we describe a novel microscopic approach that takes into account the individual identity of the cards 
and uses graph theory and models to visualize, analyze, and interpreting problem categorization ex- 
periments. We apply these methods to an introductory physics (mechanics) problem categorization 
experiment, and find that most of the variation in sorting outcome is not due to the sorter being 
an expert versus a novice, but rather due to an independent characteristic that we named "stacker" 
versus "spreader." The fact that the expert-novice distinction only accounts for a smaller amount 
of the variation may explain the frequent null-results when conducting these experiments. 

PACS numbers: 01.40.Fk,01.40.Ha,01.55.+b,01.90.+g 



I. INTRODUCTION 



Physics education is key to the development of new 
physicists and to the development of the field through 
physics research. Other than working in research labs 
doing fundamental physics research, many physicists go 
on to solve important problems while working in other 
fields. An effective physics curriculum clearly prepares 
students for a diverse array of jobs, which explains in 
part why physic ists are well-k nown for being resourceful 
problem solvers. lLarkin et al\ concluded that the basis of 
this problem solving ability is the array of cognitive con- 
nections between multiple concepts, making each physics 
concept a part of a coherent whole rather than disparate 
bits of information 2 -. Fuller points out the importance 
of a good conceptual understanding when he says, "Ev- 
ery physicist knows the importance of having the correct 
concept in mind before beginning to solve a problem,"— 
(emphasis ours). 

Categoriza tion studie s comparing experts and novices 
started with Ch i et al\ , who studied the categorization 
of introductory physics problems^. This study, which, 
to date, has been cited over 3000 times, has been criti- 
cal in the study of the differences between experts and 
novices in many areas such as Clinical Psychology^, di- 
nosaur expertise^, wine tasting^, and even Star Wars phi- 
losophy— All of these studies go back to the same appar- 
ently straightforward result in physics: novices catego- 
rize introductory physics problems by "surface features" 
(e.g. "incline," "pendulum," or "projectile motion"), 
while experts use "deep structure" (e.g. "energy conser- 
vation" or "Newton's second law"). We now take a closer 
look at previous categorization studies involving physics 
problems. 



II. PREVIOUS CARD-SORTING STUDIES 



The novice group of lChi et ail 's study was made up of 
eight students who had just finished the first semester of 
an introductory university physics class, and the expert 
group was made up of eight advanced Ph.D. physics stu- 
dents. Both groups were given the instructions to sort 
the problems "based on similarity of solution"—. Prob- 
lems were allowed to be placed in two (or more) categories 
if the sorter so desired; we call this "multiple categoriza- 
tion," as opposed to "single categorization," where each 
problem would have to be sorted in one and only one 
category. 

Each sorter categorized their set in front of a member 
of the research team according to a uniform protocol. 
Sorters were required to sort the problems without pa- 
per and pencil to prevent them from actually solving the 
problems. After sorting the problems a second time — 
to check for consistency — the sorters explained the rea- 
soning for their groupings. After a qualitative analysis 
of the cat egory names used by more than two sorters, 
IChi et all 's group concluded that the key distinction be- 
tween experts and novices is, quite sensibly, that experts 
sort problems based on the physics principle required to 
solve each problem, while novices sort the problems based 
on surfac e features. This difference in categorization, 
[Ch i et al\ concluded, was an experts' ability to convert 
contextual cues from the problem texts and figures into 
the physics principles that are required to solve those 
problems. The main message from Chi's paper is that 
this difference in categorization behavior allows experts 
to be better problem solvers than novices^. 

In a subse quent stud y, Veldhuis attempted to verify 
the result of IChi et alp . Veldhuis had three groups, a 
novice group comprised of 94 introductory physics stu- 
dents, an intermediate group of 5 students who had just 
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TABLE I. IVeldhuisl 's Matrix Method. Deep Structures 
are listed along th e top. Su rface Features are listed along the 
left. By "terms" IVeldhuisl includes "physical arrangements 
of object s and literal physics terms" in the problem text^ 
IVeldhuisl created this set hoping that experts would group the 
problems by column and novices would group the problems 
by row. 





Newton IP 


E cons b 


p cons_^ 


L cons d 


Spring 


Prob 16 


Prob 2 


Prob 4 


Prob 9 


Ramp 


Prob 11 


Prob 6 


Prob 12 


Prob 15 


Pulley 


Prob 5 


Prob 14 


Prob 13 


Prob 8 


Terms 


Prob 3 


Prob 10 


Prob 7 


Prob 1 



a Newton's second law 

b Energy conservation 

c Linear momentum conservation 

d Angular momentum conservation 



finished classical mechanics, and an expert group of 20 
physics professors — among whom only 2 had not taught 
calculus-based physics. Vcldhuis created four different 
categorization sets, one of which was given to each sub- 
ject to c ategorize a ccording to a protocol similar to that 
used by IChi et all 's gro up. The fi rst set was created in 
an attempt to mimic the lChi et al\ problem sel£, and the 
second was a control set with a similar collection of end- 
of-chaptcr problems. In contrast, the third and fourth 
sets were carefully constructed so that each problem had 
only a single physics principle and a single surface fea- 
ture from a set of principles and surface features^. For 
example, TableUshows how the third set was constructed 
by populating a matrix of four surface and four concep- 
tual features. The fourth set was also "rigged." It had 
the same number of cards, but only two surface and two 
conceptual features. IVeldhuisl could not draw a conclu- 
sion from the categorizations from his first two p roblem 
sets. However, sets 3 and 4 agreed with lChi et al\ in that 
experts categorize problems based on physics principles 
while novices s how a "more complex behavior."^ Ironi- 
cally, IVeldhuisl observed that distinguishing experts and 
novices based on surface features of their categorizations 
failed unless the desired physics features — conceptual 
and surface — were built into the design of the experi- 
ment. 

More recently, the work done in Singh's group at the 
University of Pittsburgh has broaden ed the application 
of "card-sorting" to other fields^—. iMason and Singhl 
compared students in introductory physics courses with 
both physics gra duate students and physics faculty. 
IMason and Singhl created two categorization sets of 
twenty-four problems ea ch. The fi rst set was created 
in an attempt to mimic Chi et all 's set. Seven prob- 
lems were directly from [Ch i et all 's original set, based 
on ex amples giv en in the paper, while the remainder 
of the IChi et all 's original set is apparently lost in his- 
tory. A second set was devised because the results from 
the first set showed "major differences" with Ch i et alis 



dat a 10 ' 11 , which may not be surprising given Veldhuis's 
previous results^' 9 . Each subject, upon reading the prob- 
lems, filled in three columns on a response sheet: cate- 
gory name, the appropriateness of the category name, 
and the identity of problems that fit in the category. 
IMason and Singhl then rated each problem's category as 
"good," "moderate," or "poor" based on each sorter's 
description of the category. A category was considered 
'good' if it was based on the underlying physics princi- 
ples. He then asked a faculty panel to validate his ratings 
by following the same procedure on a subset of the cate- 
gorizations. 

[m ason and Singhl fo und that the problems taken di- 
rectly from Chi et oil 's original study were placed by 
novices in "good" categories far less often than they did 
on average, determining that they were generally from 
topics more difficult to novice students. For example, 
difficult topics for novices might have been rotational mo- 
tion, non-equilibrium applications of Newton's 2nd law, 
or the Work- Energy theore m 10 ' 11 . IMason and Singhl also 
found that the superficial category names were far less 
prevalent in his study than in Chi's original study. It 
is possible that the shift away from novices' use of su- 
perficial category na mes is due to a change in curricular 
focus precipitated bv lChi et aL's resu l t. Contrary to the 
sharp distinction found by Chi et all , IMason and Singhl 
found that there was some overlap between the calculus- 
based introductory physics students and the graduate 
studentsi&li. 

In a follow-up study, ISinghM asked graduate student 
teaching assistants to perform a similar categorization ex- 
ercise, both as themselves and through the eyes of their 
students, and compared both types of their categoriza- 
tions to physi cs faculty and int roductory students. In 
contrast with IChT et all , ISinghl considered the physics 
faculty as the "true experts" and only looked at grad- 
uate students as a sort of intermediate group. Similar to 
IMason and Singhl . problem categories were rated to be 
"good" , "mod erate," or "poor," validated by a faculty 
panel. ISinghl found that the graduate students acting 
as introductory students performed better on the cat- 
egorization task than did actual introductory students, 
thus overestimating their students^. Singh found that 
the professors performed best on the categorization task, 
distinguishing this group from the categorizations of the 
graduate students acting as themselves. This suggested 
that the use of graduate students as an expert group 
is not entirely accurate, as their behavior is not truly 
expert-like. 

Finally, in a separate study, iLin and Singhl also car- 
ried out a categorization study concerning Quantum Me- 
chanics problems^. For this task the novice group con- 
sisted of twenty-two Junior and Senior physics majors 
taking Quantum Mechanics. The expert group consisted 
of six faculty members ^. In contrast to the previous 
studies mentioned here, ILin and Singhl chose to have a 
three-member faculty panel evaluate all of the categoriza- 
tions, scoring each category as either good, moderate, or 
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poor. In co ntrast to the st udies of introductory physics 
problems, in lLin and Singhl 's study, the expert group had 
more variability, as even the faculty panel did not see this 
task in stark terms. Two of the panel members even said 
that they disliked using the terms "good" and "poor" to 
describe a categorization of Quantum Mechanics prob- 
lems; this reservation was not voiced by the raters in the 
introductory problem categorization studies^. Similarly, 
the faculty panel members said that sometimes they pre- 
ferred another categor ization choice to their ownH. All 
of this, iLin and Singhl conclude, was due to the more dif- 
ficult nature of the problems. In any case, it is clear that 
no "ideal" set of groupings existed, and it was impossible 
to simply assign some "scor e" to a giv en categorization. 

In summary, replicating IChi et all 's seminal experi- 
ment is challenging. More often than not, attempts to 
repeat it fail, as an informal survey among physics edu- 
cation researchers indicates — however, such null-results 
do not get published. Yet, as a community of physics 
educators, we hold a firm belief that deep down there is 
a significant difference in problem solving behavior be- 
tween experts and novices, and that categorization is an 
important piece of the puzzle. Quantifying this differ- 
ence, however, more often than not, remains elusive. 



III. METHOD PHILOSOPHY 

While IChi et all 's method has been the predominant 
paradigm for follow-up studies, their methodology is 
based on a certain model of the categorization process. 
Using a different model, one will arrive at a different 
methodology. Given the importance of this experimen- 
tal technique, we believe it is important to understand 
the underlying model and consider alternatives to its as- 
sumptions. 



A. Macroscopic versus Microscopic Cluster 
Comparison 

[Ch i et ali s group looked at a processed version of the 
category names agreed upon by multiple sorters and 
counted the number of problems in each category namei. 
Their analysis does not seem to hinge on the identity of 
the problems in each group, merely the number of prob- 
lems in that group. For example, if two sorters both used 
the category name "Conservation of Energy" but one 
sorter put problems {1, 3, 5, 7, 9} in that set a nd the othe r 
sorter put problems {2, 4, 7, 8, 9} in that set, IChi et all 's 
analysis would count that as two people who both used 
an energy related variant as a category and both had five 
problems in that set. In other words, the sets would be 
treated identically. We argue that it is important that 
these two groups should be treated differently, as they 
have few identical elements. We believe that instead of 
just these "macroscopic" measures (sizes and names of 
groups), the sorting results should also be compared on 



the "microscopic" level of individual problems. 



B. Deterministic versus Variable Nature of Sorting 

Different understandings of the underlying process of 
categoriz ation will lead to different statistical analysis 
methods. [Ch i et al\ seem to view categorization as a de- 
terministic process, as evidenced by the "double-check" 
step in their experimental method. They see any mi- 
nor replication variation as evidence of an underlying 
method. On the other hand, one of the phenomena that 
physics education research has to grapple with is the vari- 
ability of learner responses to what appear to be iden- 
tical scenarios, see for example iFrank et alir ^. Rather 
than interpreting card-sorting outcomes as reflections of 
stable theories or beliefs, an alternative model is that 
they are based on ad hoc assemblies of more simple in- 
tuitions (similar to "phenomenological primitives,"— or 
"resources"—) — those are then assembled "on-the-fly," 
and the particular assembly may vary on circumstances. 
There is no reason to expect that card-sorting experi- 
ments are immune to this variability, and one may thus 
expect that any sorter who categorizes the same set of 
problems on separate occasions would return different 
results, although he or she might even recognize the 
problems that arc used. We cannot control the actual 
mechanisms potentially underlying these "random" out- 
comes, but have accounted for the resulting variability 
in the choice of our statistical methods. In addition, we 
use sample-based statistics to interpret our categoriza- 
tion data, realizing that our sample is only part of a 
vastly larger population. 



C. Parametric versus Non-Parametric Scoring 

Previous analysis methods^— describe each catego- 
rization individually with a score, which is either a com- 
parison to an "ideal" categorization set or an individual 
"grade" of each set. These methods measure performance 
on the categorization task, where the scoring criteria is 
an input of the evaluation process — the process starts 
with assumptions of what properties an expert catego- 
rization will have. It may, however, not be clear what 
an "ideal" set is, which in turn makes the scoring some- 
what ambiguous. First, curricular emphasis within any 
physics program varies over time; as does the researcher's 
personal categorization. Therefore that researcher may 
rate the same data differently if he or she were to re- 
evaluate the same categorization set again. Second, the 
experiment will not be repeatable from one group to an- 
other using these methods because each individual exper- 
imenter's ideal categorization of the same set will be dif- 
ferent, po ssibly creating a large distortion in the analysis. 
Third, as ILin and Singhl found, as topics become more 
complex, an expert will express uncertainty in his or her 
own choice, sometimes preferring the choice of another to 
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his or her own. Finally, if one evaluates each categoriza- 
tion subjectively based on the expected deep structure 
category for each problem, one assumes the deep struc- 
ture vs. surface features distinction rather letting that be 
a conclusion of the statistical analysis. We believe that 
any groupings should emerge from the data itself. In 
other words, the properties and patterns of what makes 
a categorization expert-like should be an output of the 
experiment. Similar to outcomes from non-parametric 
data-mining, it may not always be clear what these char- 
acteristics mean, as they are frequently combinations of 
many features or latent factors. 



D. Visualization of the Data 

Finally, several st udies utili zed dendograms to inter- 
pret their data, e.g., IVeldhuisfe . Wh ile dendo grams are 
intuitive, they are not very stable. iMilligaiJ ^ investi- 
gated a number of clustering algorithms and compared 
them using Monte-Carlo generated data from a defined, 
yet synthetic, cluster mo del which employed random per- 
turbations. According to lMilliganl . complete linkage clus- 
tering, a type of dendogram analysis, struggles to recover 
clusters when there are outliers present in the dataset. 
Another type of dendogram analysis, single linkage clus- 
tering, is highly sensitive to noise in the dataset. It is 
for these reasons that it is important to pre-process your 
data before putting in a subset of your sorters so that you 
get a dendogram that is clear and intcrprctablc. How- 
ever, interpreting a dendogram is a subjective exercise as 
each dendogram will have a unique threshold where the 
tree has clustered into groups, yet has not begun to co- 
alesce into a single stem on the tree. Some dendograms 
do not have any distinguishable groups at all. We de- 
sired to have an experimental method that required no 
pre-processing, with a reliable and easily interpretable 
output suitable for further analysis. As a result, we have 
chosen a different approach, based on graphs. 



E. An Alternative Approach 

Given the above concerns, we explored a different 
model of analyzing and interpreting card-sorting data. 
To describe clustering on an individual problem level, we 
decided to approach the analysis as a network. Instead of 
looking at piles, we decided to look at individual question 
cards (nodes in the network) and relationships (edges, in 
this case due to nodes "being in the same pile"). Net- 
works are well described by graph theory. As the re- 
lationship "being in the same pile" has no direction (if 
problem A is in the same pile as B, then B is in the same 
pile as A), we are looking at undirected graphs. The 
resulting graphs have the advantage of converting an ab- 
stract network into an object that can both be visualized 
and analyzed using an established canon of mathematical 
methods. 



As scientists, we prefer simple explanations to complex 
ones, and sought to distinguish experts from novices us- 
ing the simplest test possible. It is for this reason that we 
compare these categorizations' macroscopic features be- 
fore continuing on to microscopic features. The key dis- 
tinction between the macroscopic and microscopic scales 
is that the macroscopic scale should not be sensitive to 
the identity of the problems, while the microscopic scale 
should be highly sensitive to problem identity. In choos- 
ing mathematical methods for further a nalysis, we were 
unexpectedly limited by one feature of [Ch i et q/J 's and 
subsequent studies: the "multiple categorization," i.e., 
the fact that one and the same question card is allowed 
to be in more than one pile. This presented a challenge 
to several existing algorithms. The key measurement we 
make is a "distance" measurement between each pair of 
categorizations. Given these distances, we used Principal 
Components Analysis (PCA) to visualize the data in a 
few simple plots. 



IV. VISUAL AND MACROSCOPIC 
PROPERTIES OF SAMPLE EXPERIMENTAL 
DATA 

We designed and carried out a card-sorting experiment 
on physics experts and novices at Michigan State Univer- 
sity A total of 18 physics professors and 23 novices par- 
ticipated in our study. All of the novices have completed 
at least the first semester of an introductory physics 
course at MSU. We gave each sorter a set of 50 problems 
to sort based on similarity of solution. The physics fac- 
ulty were given the set and allowed to choose a time when 
they would complete the task at their convenience while 
the novices were asked to complete the task during a win- 
dow of a few hours in an informally supervised setting. 
Each sorter categorized his or her problems and recorded 
his groups and group names in a separate packet. Mul- 
tiple categorization was allowed, but it was in no way 
communicated to the individual sorters that this prac- 
tice was expected or endorsed. 



A. Visualizing Categorizations as Graphs 

Analyzing the experimental data in terms of graphs re- 
quires a shift in conceptualization. As a simple example, 
consider ten questions categorized into four categories. 
Suppose that the first category is Newton's second law 
and contains problems {2,4,6,8,10}. Suppose also that 
the second category is conservation of energy and con- 
tains problems {1,3, 5, 7, 9}, the third category is conser- 
vation of momentum and contains problems {2,3,5,7}, 
and the fourth category is kinematics and contains prob- 
lems {1, 4, 9}. At this stage in the process, the names of 
the categories are irrelevant. In order to create a graph 
of categorization data we represented question (cards) as 
the nodes and used each category to create a set of edges. 
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To start out, we summarize the categorization informa- 
tion in a matrix T. This matrix is a Boolean (0 |1) table 
with the items being sorted placed along the rows and 
the categories in each column. For this example catego- 
rization the T matrix is: 



T = 
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This is then converted into a weighted adjacency matrix 
Xij representing the number of times that item i and 
item j are in the same category. Specifically, 

X ij =^T ik T jk (l-S i j) (1) 

k 

where Sij is the Kronecker delta. Note that Xa = 
because in the context of graph theory a term on the 
diagonal will draw an edge from an object to itself. Thus, 
represents the number of edges that must be drawn 
between two vertices i and j on the graph. The graph of 
this example is shown in Figure [TJ 

Also from the weighted adjacency matrix the adjacency 
matrix Aij can be derived: 

Aij = min (X^, 1) (2) 

We applied this method to the physics problem catego- 
rizations created by each sorter. In doing so, we obtained 
i) graphs that we may inspect visually ii) adjacency ma- 
trices which will be useful for the calculation of certain 
statistics and iii) weighted adjacency matrices which will 
be useful when we consider our distance metric. 

In order to visualize the graphs seen in Figure [T] as 
well as the other categorization graphs throughout this 
paper, we utilized the R statistical software's^ igraph 
packagoi2.. There are currently 13 different algorithms 
programmed into R for determining node placement, and 
each would cause the same graph to look very differ- 
ently. We initially used the Kamada-Kawai algorithm^, 
however, we eventually chose the Fruchtcrman-Rcingold 
algorithm^ because it does the best job of illustrating 
multiple categorization. 

Fig. [5] shows the power of the visualization technique: 
while our sample data had more than 40 participants 
sorting 50 cards each into any number of piles, flipping 
through the graphs in less than a minute allowed us to 
identify the outliers (such as Sorter 16 in the figure) and 
general features along which to distinguish the sorters. 




FIG. 1. Simple example graph: When two problems are 
in the same category more than once (problems 1 and 9 as 
well as problems 3, 5, and 7 in this example) the edges drawn 
between those two corresponding vertices are thicker. The 
line width of each edge was taken proportional to the square 
of the number of connections between two vertices. 

B. Number of categories 

The "number of categories" is a frequently used macro- 
scopic measure of card sorting distributions and not yet 
particular to graph theory. It should be noted that 
IChi et all 's experiment found that experts and novices 
created, on average, the same number of categories. In 
order to extend this, we perform a test that compares the 
entire distributions, which includes differences in skew- 
ness or shape. For example, a Gaussian distribution and 
a bimodal distribution with the same mean and standard 
deviation would be discriminated in our tests whereas 
they would not be discriminated when only comparing 
averages. In order to compare two distributions, we con- 
sider the Empirical Cumulative Distribution Function 
(ECDF), which is calculated from each normalized dis- 
tribution D(x) as follows: 

ECDF(x) = j D (x') dx' (3) 

J — oo 

For the category number distribution the ECDF[x) rep- 
resents the fraction of sorters who have x or less cat- 
egories. We used the 2-sample Kolmogorov-Smirnov 
goodncss-of-fit hypothesis test (KS-test). The KS-test 
statistic is the maximum difference between two ECDFs. 
Sample distributions from the same population have a 
known KS-test statistic distribution. This allows for the 
calculation of a p- value much in the same way that a p- 
value is calculated from a T-tcst. This p-value behaves 
in the usual way: If p > 0.05, then the distributions are 
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FIG. 2. MSU physics study sorter graphs: Displayed from left to right are the categorization graphs for representative 
sorters. Sorters 2 and 16 were experts and sorters 20 and 30 were novices. Sorters 2 and 30 did very little multiple categorization, 
sorter 20 did a good bit of multiple categorization. Sorter 16 was unique in choosing to categorize each problem between 2 and 
3 times. 



not statistically different at a 95% confidence interval. 
A KS-test comparing the ECDFs of expert and novice 
number of categories (see Figure [3]) demonstrated no sta- 
tistically significant difference ( p = 0.4793). This result 
confirms and expands lChi et all 's result regarding the av- 
erage number of categories for experts and novices. Fur- 
thermore, we see that these distributions are consistent 
with a binomial distribution. 



Connectedness 



The number of so-called 3-cycles macroscopically de- 
scribes the connectedness of a graph, and is the first 
graph theoretical measure we apply. A 3-cycle is a sub- 
graph of three vertices where all vertices connect by 
edges. In our example, shown in Figure [TJ one of the 
24 3-cycles is the sub-graph including vertices {1,3,5} 
because they are all connected by (at least) one edge. 
However, the sub-graph including vertices {1, 2, 3} is not 
a 3-cycle because vertex 1 is not connected to vertex 2. 
This statistic is related to how often a sorter categorizes 
cards in multiple piles. Contrary to the previous exam- 
ple where 7 of the 10 problems were categorized twice, 
now consider the following example without any multi- 
ple categorization. Suppose the conservation of energy 
category has problems {1, 4, 7, 10}, the Newton's Second 
Law category has problems {2,5,8}, and the conserva- 
tion of momentum category has problems {3,6,9}. In 
this categorization, where there arc no problems mul- 
tiply categorized, there are only six 3-cycles. As such, 
the 3-cycle distribution is extremely useful for analyzing 
the connectedness of graphs. A KS-test comparing the 
ECDFs of expert and novice 3-cycle distributions (see 
Figure @| demonstrated no statistically significant differ- 
ence (p = 0.1584). This result was expected because con- 
nectedness does not take problem identity into account. 
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FIG. 3. Distribution of Number of Categories: Here we 
see the ECDFs of the number of category distributions for 
experts and novices separately. The faculty set is displayed 
using the dashed curve and the novice set is displayed using 
the dotted curve. We also compare these distributions to a 
sample (N = 1000) shifted binomial distribution with prob- 
ability p — 0.204. A Kolmogorov-Smirnov test comparing 
these two distributions suggests that expert and novice cate- 
gorizations are not distinguishable based on category number 
(p = 0.4793), and both are well approximated by the same 
shifted binomial distribution. 



D. Cliques 

Our next macroscopic test considers the so-called max- 
imum clique size. Cliques quantify maximally connected 
sub-groups. In our context the maximum clique size is 
the size of the largest "pile" that a sorter has created. 
A KS-test comparing the ECDFs of expert and novice 
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FIG. 4. Number of 3-cycles: This is the distribution of the 
number of 3-cycles for experts and novices. A Kolmogorov- 
Smirnov test suggests that experts and novices are not dis- 
tinguishable based on their 3-cycle distributions (p = 0.1584). 



FIG. 5. Maximum Clique Size: This is the distribu- 
tion of the maximum clique size for experts and novices. A 
Kolmogorov-Smirnov test suggests that experts and novices 
are not distinguishable based on their maximum clique size 
distributions (p = 0.0587). 



maximum clique size distributions (see Figure [5]) demon- 
strated no statistically significant difference (p = 0.0587). 
Similar to the connectedness result in the preceding sec- 
tion, this result was expected as maximum clique size 
does not take problem identity into account. 



E. Diameter 

The so-called diameter is a macroscopic measure that 
describes the number of jumps it takes to get between the 
two least connected points. An example of this statis- 
tic is the so-called maximum Erdos number, which says 
that many mathematicians can be connected to Paul 
Erdos in 8 steps or less by assuming that two math- 
ematicians are connected if they have collaborated on 
at least one project. As such, the diameter distribution 
is extremely useful for comparing the maximum relative 
sizes of graphs. As most of our graphs are unconnected 
(not every pair of nodes has a path between them), this 
introduces a difficulty of how to determine the diame- 
ter. While some would choose to find the diameter to 
be the number of nodes in the graph +1 (or 51 in our 
case), we chose to ignore all unconnected nodes. This 
was done to ensure the largest possible variation in our 
data. If we had made the former choice, the ECDF would 
have (nearly) looked like a step function which would 
have given the distributions an artificial look, and caused 
the differences in the data distributions to be almost en- 
tirely determined by the outliers, rather than the group 
as a whole. A KS-tcst comparing the ECDFs of expert 
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FIG. 6. Diameter: This is the distribution of the diameter 
of the experts and novices. A Kolmogorov-Smirnov test sug- 
gests that experts and novices are not distinguishable based 
on their diameter distributions (p = 0.6432). 



and novice diameter distributions (see Figure [6]) demon- 
strated no statistically significant difference (p = 0.6432). 
This result was also expected as diameter does not take 
problem identity into account. 
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FIG. 7. Average Path Length: This is the distribu- 
tion of the average path length for experts and novices. A 
Kolmogorov-Smirnov test suggests that experts and novices 
are not distinguishable based on their average path length 
distributions (p — 0.3906). 

F. Average Path Length 

The average path length is a macroscopic measure that 
describes the average number of jumps it takes to get be- 
tween all unique pairs of points. As such, the distribution 
of average path lengths may be used to compare the aver- 
age relative sizes of the different graphs. The calculation 
of the average path length is subject to the same diffi- 
culty due to unconnected graphs as is the diameter. In 
this case, we chose to set the path length between un- 
connected nodes to be 51, rather than ignoring them. In 
this setting, we feel that this measure includes both the 
local structure of the graph and a measure of how un- 
connected the graph is as well. As a result, we note that 
the range of average path length is much larger than the 
diameter. However, a KS-test comparing the ECDFs of 
expert and novice average path length distributions (see 
Figure [7]) demonstrated no statistically significant differ- 
ence (p — 0.3906). This result, combined with all of the 
previous results suggests that our hypothesis that expert 
and novice categorizations can not be distinguished with- 
out taking problem identity into account has merit. 

V. CATEGORIZATION MODELS 

All of the macroscopic statistical measures, that is, 
measures which dealt with just the groups of cards and 
not the individual cards and their identities, yielded no 
significant distinction between expert and novice sorters. 
For now, visualizing the data was successful in quickly 



recognizing outliers (subjects who sort differently), but 
those outliers were not necessarily more prevalent among 
experts or novices. We now aim to construct a model of 
the categorization process that has the same macroscopic 
and visual properties as our sample experimental data. 
Along the way, we learn more about human behavior 
during categorization tasks. 

We started out by using two standard models fre- 
quently used in graph theory literature. Unfortunately, 
neither of these two standard models reproduces the 
data, in spite of the fact that they are generally con- 
sidered complimentary. We thus created our own model, 
which generated more realistic model data. 



A. Standard Erdos-Renyi and Barabasi Models 

An Erdos-Renyi model generates a "uniform" graph, 
that is a graph where any two vertices have a certain 
fixed probability of being connected^. Uniform graphs 
may be generated as random realizations of a model hav- 
ing two parameters: the number of nodes and the prob- 
ability that nodes will connect. Barabasi graphs, a kind 
of a "small-world" graph often used to model social net- 
working connections^, is created by adding one node at 
a time, and connecting this new node with the existing 
nodes on the graph with a probability related to the num- 
ber of edges already connected to each node P oc N a + b. 
The model for a Barabasi graph has three parameters, the 
number of nodes in the graph, the probability to connect 
to a node with no other connections (6), and the power 
(a) by which the number of edges already connected to a 
node (N) is raised. We describe next the statistical com- 
parison and analysis of graphs generated by these models 
to the graphs generated by our human sorters. 

First, we considered the Erdos-Renyi model. See Fig- 
ure [8] for examples of Erdos-Renyi graphs. In order to 
determine the best input parameters for our model we 
optimized these parameters using the standard algorithm 
"optim" found in Ri&. This was done by calculating 1000 
random graphs from the Erdos-Renyi model using test 
parameters and calculating the 3-cyclc distribution from 
those graphs. This distribution was then compared to 
the combined expert and novice 3-cyclc distribution from 
our experiment and we calculated the KS-test statistic for 
those two distributions. Ultimately, the parameters that 
we determined through this optimization for the Erdos- 
Renyi model were the ones that produced the minimum 
KS-test statistic between the sorter distribution and the 
Erdos-Renyi model distribution. See the left panel of 
Figure [9] for a comparison of the ECDFs for these 3- 
cycle distributions. While the optimization was only 
done for the 3-cycle parameter, the minimum KS-test 
statistic corresponded to p < 10~ 6 . We also compared 
the sorter distributions to the Erdos-Renyi model distri- 
bution for maximum clique size, diameter, and average 
path length for these optimized parameters. In every case 
we found p < 10 -6 and therefore the Erdos-Renyi model 
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with optimized parameters does not statistically describe 
the sorter data. Next we considered the Barabasi model: 
See Figure [8] for examples of Barabasi graphs. We re- 
peated the same optimization process for the Barabasi 
model parameters and also compared the sorter distri- 
butions for 3-cycles, maximum clique size, diameter, and 
average path length. In every case we find p < 10~ 6 and 
therefore the Barabasi model docs not statistically de- 
scribe the sorter data. Due to the difficulty that these 
canonical models have in describing the sorters' behavior 
we have chosen to create our own model, which we will 
call the Cognitive Categorization Model (CCM). 

B. Cognitive Categorization Model (CCM) 

As standard models failed to reproduce our experimen- 
tal data in a satisfactory way, we constructed our own 
model, which is directly based on the rules of the cate- 
gorization experiment: 

1. All questions must be put into a category. 

2. All categories must have at least one question in 
them. 

3. A question may fall into more than one category. 

The latter rule is mathematically cumbersome, but had 
to be included since it is standard procedure in most 
experiments, including the one that was the base of our 
sample data in the previous section. 

Our new model, which we call the Cognitive Catego- 
rization Model (CCM), has three parameters: The first 
parameter of the CCM model (Q) represents the num- 
ber of questions that are being categorized in the exper- 
iment. The second parameter is the average number of 
categories determined by a sorter. As we described in 
Subsection IIVB1 a shifted binomial distribution fits the 
category number data rather well. A binomial distribu- 
tion is the "weighted coin" distribution — if you flip a 
weighted coin TV times, what is the probability that you 
will get "heads" k times? In principle, one can flip a coin 
N times and get tails every time. By Rule #1, we do 
not want to allow zero categories, therefore we must in- 
troduce a shift. It would also be senseless to create more 
categories than questions, so we wish to choose a number 
of categories from between 1 and Q. The simplest way 
to do this is to generate a number from the binomial dis- 
tribution between and Q — 1 and then add 1 to each 
of these randomly generated values. The probability of 
success is chosen to correspond to the final average num- 
ber of categories. The final parameter is the probability 
to categorize a card into more than one pile. After each 
problem has been sorted into a single pile, the algorithm 
tests whether that problem should be sorted into other 
categories as well. Our model, with 2 free parameters is 
on par with the Erdds-Rcnyi model (1 free parameter) 
and the Barabasi model (2 free parameters). In addition 
to the fact that the CCM parameters are interpretable, 



the small number of CCM parameters makes this model 
parsimonious. 

Appendix [X] shows the pseudocode for this model. In 
our code, we implement multiple categorization by gen- 
crating a random number between zero and one and com- 
paring that number to our multiple categorization prob- 
ability. However, there are a number of ways that we 
can model the multiple categorization probability. The 
simplest way is to allow every sorter to have a uniform 
probability and say that some percentage of the time a 
card will be split again. So for this model the multiple 
categorization probability is constant: 

^multiple = Pi (4) 

where j3\ is a constant between zero and one which ap- 
plies to the entire population. Another way that we con- 
sider assumes that a penalty is incurred whenever a card 
is split: 

^multiple == @2 (5) 

where /?2 is a constant between zero and one which ap- 
plies to the entire population and N is the number of 
times that a problem has already been categorized by a 
random sorter. Finally, we consider a model where the 
multiple categorization probability depends on the num- 
ber of categories (C) that a sorter has selected which was 
determined by the binomial distribution. 

f multiple = 03 (6) 

where /?3 is a constant between zero and one which ap- 
plies to the entire population. The differences between 
these three choices are so subtle that we cannot see a 
difference between them by eye using the graphical rep- 
resentation. 

In order to determine best-fitting parameters for each 
of the models we considered, we minimized the KS-test 
statistic between the data 3-cycle distribution and the 
model 3-cycle distribution. For the CCM, we used a sim- 
ple brute-force grid search instead of the standard opti- 
mization algorithm found in the R statistical software^. 
The reason for this difference was that the 3-cycle dis- 
tribution was better approximated with smaller sample 
sizes for the two standard graph theory models. How- 
ever, running the standard optimization algorithm for the 
larger sample sizes required by the cognitive categoriza- 
tion model took much longer and the brute force method 
quickly became preferable as we could use smaller sam- 
ple sizes to get some coarse grained resolution. Later, we 
then used larger sample sizes when we got close to the 
end result. Once we obtained optimized parameters for 
the different CCMs, we compare them (see Figure [TU|) to 
the human sorters based on the 3-cycle distribution. 

We found that the best fitting CCM had a multiple 
categorization probability that depended on the number 
of categories as seen in Equation [5J Figure [TT] shows 
us some example CCM graphs displayed with optimized 
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FIG. 9. 3-cycle distributions: On the left is an ECDF of the 3-cycle distribution for sorters and the Erdos-Renyi model. On 
the right is the corresponding ECDF for the sorters and the Barabasi model. In both graphs the dashed line corresponds to 
the appropriate random model while the solid line corresponds to the sorter data. 



input parameters for the model that we eventually se- 
lect. These CCM graphs shown look much more like the 
graphs of the human sorters seen in Fig. [2] 

The success of this model gives us insight into the be- 
havior of our sorters as the probability to categorize a 
single problem in multiple categories is different for each 
person and that probability actually decreases with the 
number of categories that are created. This model pre- 
diction is supported by an observation we made of our 
sorters. We observed two different sorter behaviors while 
they worked on the categorization task. It seemed like 
some people were resolved to make as few piles as pos- 
sible, we will call these people "stackers." Stackers were 
more likely to put a problem in multiple categories, de- 
ciding that putting a problem into two piles was a bet- 
ter decision than making a new category. As a result 
a stacker's groups tend to be large and inclusive. The 



other group of people would spread the problems out on 
the table that they were working on, we will call these 
people "spreaders." Spreaders were less likely to put a 
problem in multiple categories, deciding that making a 
new category was a better decision than putting a prob- 
lem into two piles. As a result a spreader's groups tend 
to be small and exclusive. 



VI. PRINCIPAL COMPONENTS OF THE 
DISTANCE METRIC 

A. Distance Metric 

We now create a distance metric as a microscopic mea- 
sure to compare two sorters to each other. There are sev- 
eral existing distance metrics that will compare two dif- 
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FIG. 10. 3-cycle distributions Here we see the 3-cycle dis- 
tributions for the different CCMs. The model that fits the 
best is v3, where the multiple categorization probability is 



The distance metric is: 

^-^EEI^-^I (8) 

Our distance metric may be interpreted as the number 
of edges that need to be added to and removed from one 
graph to make it identical to another. The factor of \ is 
included due to the symmetry of the weighted adjacency 
matrix. In order to be a statistical distance metric, the 
distance d between any two categorizations X r and X s 
must satisfy a few properties: 

d rs = X r = X s 

d r t < d rs + d st (9) 

Appendix IBl contains a brief proof that our metric sat- 
isfies these properties. We can use this distance metric 
to create a symmetric matrix where the distance between 
sorter i and j appears in row i and column j. 



ferent sortings, including statistical indices such as the 
Rand Inde?* 2 ^ which can be converted into a distance 
metric. However, in searching for existing statistical 
methods that will work for our categorization exercise, we 
found none that obeyed the rules of our "categorization 
game," especially the third rule. The Rand index merely 
counts the number of "agreements." This may be cal- 
culated for any two categorizations. However similarity 
indicies that are not corrected for chance agreements are 
not as reliable for creating a sort of measuring stick for 
measuring a "dis tance" between two catcgorization o 25 ' 26 . 
For this reason iHubert and Arable] created an adjust- 
ment to the Rand index. However this adjustment re- 
quires that the sub-groups are disjoint, eliminating any 
utility that the adjusted Rand index has for our study 
and other similar studies which allow multiple catego- 
rization. This story may be repeated for any one of the 
other statistical indicies that we could find in the statis- 
tics literature, and after some consideration, we decided 
that we needed to invent a new method for analyzing this 
type of data. 

Our distance metric will bypass this difficulty as it 
is a direct distance metric and not a similarity index. 
The distance metric is determined by considering the 
weighted adjacency matrix for each reviewer, and it com- 
pares any two graphs generated by two reviewers as long 
as they have the same number of nodes (which they would 
for identical card sets). Each clement of the weighted ad- 
jacency matrix X r for each reviewer r is: 

X^j = number of edges between problems i and j (7) 



B. Principal Component Analysis 

The distance matrix we constructed answers the ques- 
tion "How far is sorter i from sorter jl" for every pair of 
sorters. Since we have 41 sorters, this matrix operates in 
a 41-dimensional space, which is of course impossible to 
visualize. Principal Component Analysis (PCA) is a way 
of reducing high dimensional data back down to some- 
thing more manageable. PCA is a general term in the 
statistical community describing a number of techniques 
involving the singular value decomposition. To visualize 
our data, we did a singular value decomposition on the 
distance matrix. By applying the singular value decom- 
position to the distance matrix we perform a change of 
base so that the largest amount of variation is in the first 
principal component (PCI), the second largest amount 
of variation is in the second principal component (PC2), 
and each subsequent component explains less variation 
than the previous component. This analysis is then pro- 
jected out onto fewer spatial dimensions, using only the 
most influential base vectors as a new reduced base. 

If this method is successful, the majority of variation 
can be explained with just a few components. In our 
case, taking just the first two components explains ap- 
proximately 87% of the variability in our dataset. We 
thus focus on this reduced-dimension PCA, which can 
easily be visualized in Figure 1121 We can now visual- 
ize our sorters and easily interpret what we see. The 
question of what sorter characteristic results in what be- 
havior of PCI and PC2 is lost in a 41-dimensional ro- 
tation and subsequent projection. In other words, this 
abstract representation of microscopic data (the distance 
matrix strongly depends on problem identities) does not 
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FIG. 11. Representative CCM graphs: These are some representative graphs for the CCM using optimized input parame- 
ters. Qualitatively they match up much better to the sorter graphs seen in Figure [2] 



boil down to a simple linear combination of macroscopic 
features. 

Making sense of PCI and PC2 is where the previous 
work on graph visualization ( Subsection IIV A|) and anal- 
ysis (Subsections IIV 51 through IIV Fp . combined with the 
interpretation of the CCM (Subsection IV B[) comes to- 
gether. We can look at the relative placement of our 
sorters by the PCA, visually analyze their graphs, and 
attempt an interpretation of the abstract sources of vari- 
ation found by the PCA. The expert and novice identity 
of each sorter is a variable known only to us and not a 
factor in determining the placement of the sorters by the 
PCA 

Analyzing the sorters in order of increasing PC1- 
coordinate (Fig. Q21 left panel) shows that this coordi- 
nate does not distinguish experts from novices. In other 
words, most variance in the data is not related to the 
expert or novice identity of the sorters. Instead, when 
analyzing the graphs associated with the subjects, it 
turns out that PCI mostly reflects the "stacker" versus 
"spreader" behavior identified through our CCM (Sub- 
section |VB]), which is quite independent of being an ex- 
pert or a novice. Based on this result, one could ar- 
gue that card-sorting experiments most strongly measure 
how individuals sort, and may thus be more reflecting of 
what that individual's office or the file system of his or 
her personal computer looks like than whether or not he 
or she is a physics expert. 

The expert/novice distinction only shows up in PC2. 
Going along the PC2-axis in the left panel of Figure [T2l 
one finds more experts with a high PC2 and more novices 
with a low PC2. Why that is cannot be answered at this 
point and is the subject of further research in our group. 



VII. CONCLUSION 

In our endeavor to study the categorization behavior 
of experts and novices, we have developed a method for 
analyzing expert and novice categorizations. In the pro- 
cess, we have gained insight into different human cog- 
nitive structures. Rather than focusing on qualitative 
differences in category names, we chose to focus on the 



groupings of problems. In order to do this we have cre- 
ated a method for converting an abstract categorization 
into a graph, which may then be analyzed. This conver- 
sion has laid the foundation for our method of analyzing 
card-sorting experiments, which is applicable in any ex- 
periment where sorters may put any single card into more 
than one category, a behavior which we name multiple 
categorization. 



Using experimental data, we confirmed the null-results 
that experts and novices are not distinguishable based 
on macroscopic features of their card-sorting such as the 
number of categories. This held true even when employ- 
ing graph theoretical approaches. Finding these null re- 
sult when comparing categorizations' macroscopic prop- 
erties, we created the Cognitive Categorization Model, 
which provided insight into the general sorter behavior. 
We found that the best fitting CCM had a multiple cat- 
egorization probability that depended on the number of 
categories which led us to determine that sorters tended 
toward "stacking" or "spreading" when sorting physics 
problems. A stacker tended to create a few general cat- 
egories and multiply categorize more often. A spreader 
tended to create many specific categories and multiply 
categorize less often. This stacker vs. spreader behavior 
is quite independent of the expert vs. novice distinction 
between our sorters. 



As macroscopic properties did not differentiate expert 
from novice, we studied the microscopic properties of cat- 
egorizations in creating our distance metric. This dis- 
tance metric compares sorters' categorizations in a man- 
ner which takes problem identity into account. In or- 
der to visualize the relative position of our sorters as 
measured by our distance metric, we employed Principal 
Components Analysis. This allowed us to confirm the 
stacker vs. spreader distinction as the largest source of 
variation among sorters. It also fortuitously found the 
distinction between experts and novices as the second 
largest source of variation. 
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FIG. 12. PCA of the sorter data: On the left we see the PCA plot of the sorters. PCI is the coordinate along the first 
principal axis, and PC2 is the coordinate along the second principal axis. Sorter known by us to be experts are marked by circles 
while sorters known by us to be novices are marked by triangles. Each point is labeled on the left by the sorter number. The 
second principal component discriminates experts from novices. On the right is a plot of the cumulative relative importance of 
each subsequent principal component. 



VIII. OUTLOOK 



Appendix A: Categorization Model Pseudocode 



In future studies, we will continue the process of mak- 
ing sense of the source of variation in the second prin- 
cipal component. As the distance metric is sensitive to 
the identity of the problems, we plan to study the ef- 
fects of including only a subset of problems from our 
original categorization set on the Principal Components 
Analysis. Ongoing work in our lab has shown that the ex- 
pert/novice distinction is highly sensitive to the problems 
selected. There are some subsets of problems where the 
expert / novice distinction is stark and many others where 
this distinction is non-existent. In the future, we will be 
looking at the properties of the problems which cause 
this high level of differentiation with the goal of under- 
standing what "rigging" must occur in order to observe 
this stark distinction with a high degree of certainty. 



ACKNOWLEDGMENTS 



The authors would like to thank the MSU physics 
faculty and the introductory physics classes in the Fall 
2010/Spring 2011 semester for volunteering to sort prob- 
lems for our study. We would also like to thank the 
anonymous reviewers of this journal for their helpful in- 
put and suggestions. 



The following pseudocode creates a weighted adjacency 
matrix for a random categorization according to our cat- 
egorization model. This matrix may then be used to 
create a graph. Increasing the utility of the adjacency 
matrix is the fact that many graph theory statistics are 
calculated using the weighted adjacency matrix or the 
adjacency matrix (which is a boolean version of the ad- 
jacency matrix). 



# Pseudocode for categorization model graph creation: 
for each graph 

Q = input parameter # number of guestions 
beta = input parameter 

Cbar = input parameter # avg. number of categories 
C = random deviation from binomial distribution 
Pmult = alpha^ # multiple sorting probability 

# Create boolean T matrix; rows are guestions columns are 
categories 

Initialize T 

X = randomize question numbers 

Y = shuffle list of category numbers from 1 to C 

# Rule #1 : Every category must be used 

for all j in 1 to C 
T(X(j), YCj)) = 1 

# Rule #2: All guestions must be categorized at least once 

Z = sample the list from 1 to C with replacement Q-C times 
for all j in 1 to (Q-C) 
T(X(C+j), Z(j)) - 1 

# Rule #3: Each question may be categorized more than once 

for all zero elements left in the T matrix 

if (random number from to 1 < Pmult) T(element) = 1 

# Convert T matrix into adjacency matrix (adj) where 

adj(i,j) = TCi,) dot T(j ; ) 
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Appendix B: Distance metric 

The following distance metric quantifies the number of 
edges that must be added or removed from a graph to 
make it identical to another graph: 



1 



Q Q 

EE 

i=l j=l 



XL - X! 



(Bl) 



Where X^ is the element in the weighted adja- 

cency matrix for reviewer r. The properties of a metric 
are as follows: 



d rs > 
d rs = 
d rs d a 
drf ^ d r 



X r = X s 



(B2) 



The first property is clearly satisfied by considering that 
we are summing up all positive numbers. The second 
condition is satisfied because the only way that d rs = 
is if every element of each weighted adjacency matrix 
is identical and if both weighted adjacency matrices are 
identical, then d rs = 0. The third condition is also met 
due to the symmetry of the absolute value: 



1 Q Q 

9 EE \ X iJ ~ x l 



i=l 3 = 1 



-I Q Q 
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i=l 3 = 1 



= d. 



(B3) 



Finally, we will consider the last condition. First, we will 
consider the definition of the metric: 



drt ~ 2 E E \ X ii ~ X h 

i=l 3 = 1 



Next we will utilize the additive identity to insert the X\ 
terms into the absolute value. 



1 Q Q 



drt ~ 2 E E \ X h ~ X ij + X h ~ X ij I 

i=l j=l 



Next, we continue with the triangle inequality. 
1 Q Q 

d r t < o EE \-\ X h ~ X h I + \ X h ~ X h W 

i=l 3 = 1 

Now we distribute the term in front of the sum. 



drt < 



^EEI X 

i=l j=l 



X s 
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And then we simplify using the definition of our metric. 



d r t < d rs + d s t 



So we have shown that this is a metric. 



(B4) 
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